Assignment 9: GBDT

Response Coding: Example

The response tabel is built only on train dataset. For a category which is not there in train data and present in test data, we will encode them with default values Ex: in our test data if have State: D then we encode it as [0.5, 0.05]

  1. Apply GBDT on these feature sets
    • Set 1: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF)+ preprocessed_eassay (TFIDF)+sentiment Score of eassay(check the bellow example, include all 4 values as 4 features)
    • Set 2: categorical(instead of one hot encoding, try response coding: use probability values), numerical features + project_title(TFIDF W2V)+ preprocessed_eassay (TFIDF W2V)
  2. The hyper paramter tuning (Consider any two hyper parameters)
    • Find the best hyper parameter which will give the maximum AUC value
    • find the best hyper paramter using k-fold cross validation/simple cross validation data
    • use gridsearch cv or randomsearch cv or you can write your own for loops to do this task
  3. Representation of results
    • You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure with X-axis as n_estimators, Y-axis as max_depth, and Z-axis as AUC Score , we have given the notebook which explains how to plot this 3d plot, you can find it in the same drive 3d_scatter_plot.ipynb
    • or


    • You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure seaborn heat maps with rows as n_estimators, columns as max_depth, and values inside the cell representing AUC Score
    • You choose either of the plotting techniques out of 3d plot or heat map
    • Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
    • Along with plotting ROC curve, you need to print the confusion matrix with predicted and original labels of test data points

  4. You need to summarize the results at the end of the notebook, summarize it in the table format

1. GBDT (xgboost/lightgbm)

1.1 Loading Data

1.2 Splitting data into Train and cross validation(or test): Stratified Sampling

1.3 Make Data Model Ready: encoding eassay, and project_title

1.4 Make Data Model Ready: encoding numerical, categorical features

1.4.1 encoding categorical features: School State

1.4.2 encoding categorical features: teacher_prefix

1.4.3 encoding categorical features: project_grade_category

1.4.4 encoding categorical features: project_subject_categories

1.4.5 encoding categorical features: project_subject_subcategories

1.4.6 encoding numerical features: Price

1.4.6 sentiment analysis for essay

1.4.7 Concatinating all the features

1.5 Appling Models on different kind of featurization as mentioned in the instructions


Apply GBDT on different kind of featurization as mentioned in the instructions
For Every model that you work on make sure you do the step 2 and step 3 of instrucations

Applying TFIDF weighted W2V on essay and project_title sor set2 model

Make Data Model Ready: encoding eassay, and project_title with TFIDF weighted W2V

Concatinating all the features

Appling Models on different kind of featurization as mentioned in the instructions

3. Summary